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ABSTRACT 

A description is given oi a general method for 
studying partitions. The main focus is with the analysis of 
relationships among several different partitions of the same items 
for the explorations as well as confirmation of structural 
relationships. A partition is defined as a set of mutually exclusive 
clusters of items; however r this paper deals rith overlapping or 
conjoint clusters also. The terms "clustering” and "partitioning" are 
found in the context of both data collection and analysis. The major 
impetus for the present method was the problem of studying how an 
individual (manifest) partition might be examined in relation to a 
single derived (latent) partition in the context of latent partition 
analysis (LPA) . In the process of studying characteristics of the 
distribution of the principal statistic, there was also developed an 
approach to the organization of a set of partitioned data with 
respect to any specified target. Data were analyzed first with 
respect to an a priori target which had been generated by the 
investigators on the basis of characteristics of the Morse Code 
symbols, with no regard for the letters themselves. It was found 
that: (1) The organization and display features of the proposed 

strategy may be of greatest significance, especially when stimuli or 
items are complex and when sorters are heterogeneous in terms of how 
they formed their categories; and (2) There are several other 
possible indices of association for comparing matrices, (CK) 
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Whenever anyone studies a variety of items t objects or 
variables it is almost inevitable that at some stage he will 
cluster them into categories. Such clustering seems to be a 
fundamental part of any search for knowledge since it occurs 
whenever we name objects or distinguish among different clas- 
ses of things. It is unnecessary to document here that 
clustering procedures have been used in practically every 
area of science and that a wide variety of methods have been 
employed. Yet, there have been few systematic reports on the 
characteristics of methods for clustering or partitioning data. 
The purpose of this paper is to describe a general method for 
studying partitions which appears to us to have special poten- 
tial for research in the behavioral and social sciences. Our 
major concern is with the analysis of relationships among 

i 

several different partitions of the same items for the explora- 
tion as well as confirmation of structural relationships. In 
this paper , a partition is defined as a set of mutually exclusive 

or disjoint clusters of items but our approach may be rather 
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simply extended to deal with overlapping or conjoint clusters. 
Clusters within a single partition are taken to have no 
quantitative ordering? such clusters are considered as merely 
qualitatively distinct. 

The terms "clustering" and "partitioning" are to be found 
in the context of both data collection and analysis. It may be 
helpful to provide some methodological perspective on various 
uses for these terms , and on the variety of methods which have 
been developed. Perhaps the most common use of the term 
"clustering" has been in the context of data analysis where 
N objects or variables have been used to generate all of the 
N (N-l)/2 pairwise measures of association from which some type 
of dimensional analysis is initiat d. The methods of Hartigan 
(1967) for tree structure analysis, Johnson (1967) for hier- 
archical clustering and the Guttman (1968) Kruskal (1964 a,b) 
and Shepard (1962) approaches to multidimensional scaling all 
require initially such pairwise measures of reliitedness (e.g., 
proximity, similarity, distance, correlation). Closely related 
in certain ways are the methods of parametric mapping (Shepard 
and Carroll, 1966) and multidimensional unfolding (Coombs, 1950? 
Barnett and Hays, I960). Kruskal (1969), incidentally, has 
reviewed such methods and provided some new insights as to their 
similarities and differences. Harris and Kaiser (1964) and 
Tryon (1958) have both used the term "clustering" in the 
context of factor analysis, although they have used the word 
in somewhat different ways. Hartigan has also begun work on 
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simultaneous clustering of rows and columns of rectangular 
score matrices; while yet. unpublished, this may be an important 
new direction for clustering methods since it involves a direct 
approach to organizing and understanding relationships among 
observations. Some additional sources have also been included 
in our References to facilitate further examination of the 
literature. s 

The terms "partitioning" end "clustering" have often been 
used interchangeably in the context of data collection and 
analysis. Wiley's (1967) latent partition analysis (LPA) 
defines "manifest partitions" at the level of data collection 
as well as "latent partitions" at the level bf analysis. LPA 
has served to stimulate a number of studies involving further 
elaboration and refinement of partitioning methodology (see 
INDEX - SIG Sort, Archives and Evans (1970)). In fact, the 
work reported here was motivated in part by what were seen as 
problems or limitations with the LPA approach. 

/ The major impetus for the present method was the problem 
of studying how an individual (manifest) partition might be 
examined in relation to a single derived (latent) partition 
in the context of LPA. But we began to find ourselves inter- 
ested in a more basic problem in the analysis of partitions. 
Instead of examining relationships among individual partitions 
of a set of items and some type of average partition such as 
the type which is produced in LPA applications, we decided that 
a means should be sought of measuring the goodness (or badness) 
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of fit of any single partition of N items to what we shall call 
a "target" partition. Such a measure was devised and it has 
been found to be useful for both the exploration and confirmation 
of structural relationships among a set of N terns. In the 
process of studying characteristics of the distribution of our 
' principal statistic, we have also developed what we see to be 
/ a useful approach to the organization of a set of partitioned 

i 

j data with respect to any specified target. The remainder of 

this paper is to describe details of the proposed method and to 

apply it in the analysis of data. 

Suppose, initially, that we have N items which have been 

partitioned into n categories by each of S sorters (s=*l, ... ,S). 

s 

For each sorter we may construct a binary matrix of order N x N 
the entries of which index joint occurances (1) , and non- 
occurance (0) , of the various item pairs for this sorter. 

Let us call such a matrix A . 

s 

Next, consider a binary matrix labeled A , which is analoguous to 

U 

A , except that it has been constructed as a model or "target" 
s 

partition for the set of N items. A^ is also a j oint-occurance 
matrix but it is taken as fixed; it might typically correspond 
to an experimenter's hypothesis about some "cue" system which 
sorters have used in partitioning the items. Ultimately, we 
shall also consider the possibility that any number, T, of targets 
may have been specified a priori so that the A fc could range over 
1, ... , T. 

Also let k , ... , k be the frequencies of items in the 
1 n 
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sth sorter ' s partition and m , • • • , be the corresponding 

t 

frequencies for the tth target. Later we shall have occasion 



to use functions defined from these numbers as, h(A ) » 



Ik. (k.-l) 
x 1 



Im^ (m^-1) 

and h (A. ) = 

* N(N-l) 



S N(N-l) 

where these two quantities may be called 



the partition heights. 



When one's object is to study a set of partitions in 
relation to one or more targets, it becomes useful to assess 
the degree of association between A g and A fc using a summary 
statistic. One such statistic which has been found efficacious 
is the quantity 



q = tr (A -A ) 2 /N(N-l) (1) 

St St 

where the numerator on the right-hand side represents the total 

number of discrepancies between A g and A^. Since all diagonal 

elements of both matrices are necessarily unity, it follows 

that N(N-l) is generally the maximum value of the tr ( ) ? thus 

q might be regarded as a normalized trace which ranges between 
s t 

0 and 1. q is directly interpretable as a badness of fit 
st 

statistic for the pair (A , A ) ? the statistic is symmetric 

s t 

in that permutations of (A g , A fc ) to (A^, A g ) do not change 

its value; and it is easily computed as the proportion of dis- 
agreements between A and A. . Also, it can be shown that 

S w 

q - h(A ) + h (A ) - 2h (A flA ) (2) 

St S t St 
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where (A fiA ) corresponds to the element-wise product, or inter- 
S u 

section, of the pair (A , A.). A goodness of fit statistic 

S w 

might be written as p ^ = l-q gt * 

Substantial investigation has already been made of the 
distributional properties of q gt . It has been possible to write 
a computer program to generate the exact distribution of q t for 
any final target where h(A. g ) is also fixed for A g . Combinational 
equations are used to generate all possible outcomes for A g for 
any A t ? the program may then be used to compute each of the corre- 
sponding values of q S f The expectation of q^ has been found 
to be particularly simple; 

E(q ) m h(A ) + h(Aj - 2h (A ) h (A. ) (3) 

^st s t St 

Further, more detailed results for the exact distribution are 
not included here, however. This is because the computer time 
required to find the exact distribution of q actually exceeds 
the time required for moderately large Monte Carlo distribution 
of q gt . Moreover, q gt has been found to be virtually Gaussian as 
N grows to 15 or 20 items or more for most "realistic" h(A g ), h(A t ) 
combinations. Tables 4a and 4b include two such Monte Carlo distri - 

j 

butions of q • As N grows above N * 20 for fixed heights of 

A and A , the standard error of q gt will necessarily decrease 
s t 

for fixed heights, h(A g ), h(A t ). 
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In order to apply the proposed method for the analysis 
of data, an experiment was designed in which 50 graduate 
students were asked to partition a set of 26 items, each of 
which was a pairing of ar.letter from the alphabet and its own 
Morse Code equivalent. Free sort instructions were used, with 
the only stipulation being that each student should form 
categories of items which in some sense might be thought of 
as mutually homogeneous. Sorters were thus allowed to choose 
their own system for partitioning; the intention was to 
help insure that several different bases would be chosen for 
the partitions. All students were asked to use no less than 
five, or no more than nine, categories. 

The data were analyzed first with respect to an a priori 
target which had been generated by the investigators on the 
basis of characteristics of the Morse Code symbols , with no 
regard for the letters themselves • Thus , the aim was to test 
the hypothesis for each individual that his partition is random 
with respect to the target; this is the null hypothesis. Rejec- 
tion of this hypothesis can be taken as constituting evidence, 
at some specified probability level, that the partition in ques- 
tion may not reasonably be regarded as randomly different from 
the target. To have a sufficiently small value of q S £ 1&£9© 
value of p ) is taken as evidence that the target in question 
can reasonably be regarded as having been the model in some 
sense for the individuals manifest partition. As with all 
procedures for inductive inference the interpretation should 
be qualified in that both Type I and Type II errors are possible. 

7 
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Table 1, is a display of the original data, organized by 

columns and rows with respect to target 1. Columns , representing 

items, were simply grouped according to the eight a priori 

target clusters. Rows, corresponding to sorters, were ranked, 

first by the number of categories which had been formed and, 

next, by using the statistic q = normalized tr (A -A )^. 

st st 

Also, the height coefficients were calculated from the respective 
sets of category frequencies for the sorters. For each sorter 
the nominal category numbers for the original partitions were 
then printed. 

These results show that five of the sorters (Nos. 30, 42, 

46, 48, and 18) apparently used the same cues for their 

partitions, as were used in generating this target? as expected, 

the associated q (or trace) statistics are zero. Moreover, a 

st 

number of other sorters had minimal confusions with respect to 
this target. For individuals with eight or nine category 
partitions, the theoretically expected value of the q gt statistic 

is approximately .18. Since the sampling distribution of q g ^ 

tends to have only mild negative skewness (for eight category 

partitions with ''small'' heights) and the estimated standard error 

(from Monte Carlo data) of q = .04, it might be argued that 

st 

persons with q , N < . 10 have not randomly sorted with respect 

St S 

to the target but, rather, have in some sense used the model 
of the target partition. Inspections of the individual sorters 

partitions are readily made using tables such as this one. 
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Further analyses were conducted using another target which 
was constructed from the results of an LPA analysis. Table 2 
includes a contingency table with which the a priori and the LPA 
(target) partitions may be compared. Table 3 is an analogue of 
Table 1 for the LPA-derived target, which has been labeled 
Target 2. 

As might be expected, the q g ^ values tend generally to be 
smaller for Target 2 than Target 1 (q ^ » .128, q # 2 “ -123), but 
it remains interesting that none of q s2 values are identically 
zero. Despite its overall similarity to this set of sorter 
partitions, the derived LPA partition does not seem precisely 
to have been any single person's model in the sense that Target 1 
apparently was. Again, further examinations of the rows of 
Tables 1 and 3 might be useful for further exploration of the 
data; contingency tables such as that of Table 2 might also 
facilitate more detailed study. 

Were we to have specified other target partitions with 
respect to manifest partitions of these 26 alphabetic-Morse-Code 
combinations, the data could of course just as easily have been 
organized and displayed for each target. We suggest that for 
complex items, generally, where several different schemes might 
have been employed to generate manifest partitions, that detailed 
analysis using multiple targets can provide an efficient and 
thorough analysis of one's basic data. Comparisons of statistics 
of the form of q gt , or vectors of the form of q g 88 q gl q g2 * •• q st r 

with .other measures at the level of the sorters (or groups of 
sorters) can be developed as direct analogues of standard methods 
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for analysing quantitative data. 

Finally, it might be helpful to make some further, more 
general points, about the suggested strategy for studying parti- 
tions in relation to other methods which are available. Wiley's 
(1967) latent partition analysis and Johnson's (1967) hierarchical 
clustering have both proven especially useful for explorations 
of partitions. Although the present method might also be used 
for exploration as well as explanation, we see the new approach 
as complementing, not competing with, the earlier ones. As we 

have shown above, the earlier methods may be used to generate 

/ 

targets and hence, to organize the set of partitions. The new 
strategy, based on more direct analysis of the partitions them 
selves, however, does seem to have more potential for refining 
one's understanding of the bases which have been used for sorting? 
there seems to be a substantial advantage for interpretation, and 
sharpening of hypotheses for future research, for those methods 
which do not "impose" a latent, model, such as that of 1PA. The 
£ 0 * ganization and display features of the proposed strategy may 
be of greatest significance, especially when stimuli or items 
are complex and when sorters are heterogenous in terms of how 

they formed their categories. 

Finally, it should be clear that there are several other 

« 

possible indices of association for comparing matrices of the 
form of A and A*.. Evans (1970) provides one, and others are 
implicit in his work. Also, S. C. Johnson of Bell Telephone 

/ 

Laboratories has suggested several, including one equivalent to 

our q , in an unpublished manuscript ("~ M * ic Clustering", 
st 
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circa 1970). The latter paper, in fact, which was discovered 
after completing most of the present work, includes a number of 
suggestions wbinfc anticipated our own (although Johnson does 
not use his version of our q t extensively). Nevertheless, we 
suspect that there is more to gain from critical applications 
of the available methods than from continued comparison and 
refinement of methods which, to date, have only rarely been used. 
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